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Abstract 



A new language model for speech recognition is presented. The model develops hidden hierarchi- 
es ' 

cal syntactic-like structure incrementally and uses it to extract meaningful information from the 
word history, thus complementing the locality of currently used trigram models. The structured 
language model (SLM) and its performance in a two-pass speech recognizer — lattice decoding 
- are presented. Experiments on the WSJ corpus show an improvement in both perplexity 
. (PPL) and word error rate (WER) over conventional trigram models. 



1 Structured Language Model 



> , 

An extensive presentation of the SLM can be found in |L[. The model assigns a probability 
P(W,T) to every sentence W and its every possible binary parse T. The terminals of T are 
the words of W with POStags, and the nodes of T are annotated with phrase headwords and 
O ■ non-terminal labels. 

O ■ 

Let W be a sentence of length n words to which we have prepended <s> and appended </s> 
so that Wq =<s> and w n+ i =</s>. Let W k be the word k-prefix w . . .w k of the sentence and 
W k T k the word-parse k-prefix. Figure [L] shows a word-parse k-prefix; h_0 . . h_{-m} are the 
exposed heads, each head being a pair (headword, non-terminal label), or (word, POStag) in the 
case of a root-only tree. 



h_|-m|=(<s>,SB) a '-'"') h_0 = (h_0.wotd,h_0-tag) 




«s>,SB) (w_r,t_r) .... (w_p,l_p)(w_|p+l|,Llp+ll) (w_k,t_k)w_(k+ll„.</s> 

Figure 1: A word-parse k-prefix 



1.1 Probabilistic Model 



The probability P(W,T) of a word sequence W and a complete parse T can be broken into: 

n+l ~ N k 

P(W, T) = J] [P^k/Wk^T^) ■ P(t k /W k ^T k . u w k ) ■ J] PGtf/Wfc-iTfc-i, w k , t k , p\... pti)] 

k=l 4=1 

where: 

• Wfc-i^fc-i is the word-parse (k — l)-prefix 

tThis work was funded by the NSF IRI-19618874 grant STIMULATE 




Figure 2: Result of adjoin-left under NTIabel 




Figure 3: Result of adjoin-right under NTIabel 

• w k is the word predicted by WORD-PREDICTOR 

• t k is the tag assigned to w k by the TAGGER 

• N k — 1 is the number of operations the PARSER executes at sentence position k before passing 
control to the WORD-PREDICTOR (the iVfc-th operation at position k is the null transition); 
N k is a function of T 

• p\ denotes the i-th PARSER operation carried out at position k in the word string; the 
operations performed by the PARSER are illustrated in Figures and they ensure that all 
possible binary branching parses with all possible headword and non-terminal label assignments 
for the Wi . . .Wk word sequence can be generated. Our model is based on three probabilities, 
estimated using deleted interpolation (see ||), parameterized as follows: 

Piwk/Wk-tT^) = P(w k /h ,h^) (1) 
P(t k /w k , W k -iT k -i) = P(t k /w k ,h .tag,h^i.tag) (2) 
P(p k jW k T k ) = P(p*//io,A-i) (3) 



It is worth noting that if the binary branching structure developed by the parser were always 
right-branching and we mapped the POStag and non-terminal label vocabularies to a single 
type then our model would be equivalent to a trigram language model. 

Since the number of parses for a given word prefix W k grows exponentially with k, \{T k }\ ~ 
0(2 k ), the state space of our model is huge even for relatively short sentences so we had to use 
a search strategy that prunes it. Our choice was a synchronous multi-stack search algorithm 
which is very similar to a beam search. 

The probability assignment for the word at position k + 1 in the input sentence is made using: 
P(w k+1 /W k ) = J2 P(w k+ i/W k T k ) • [ P(W k T k )/ J2 P(W k T k ) } (4) 



which ensures a proper probability over strings W*, where S k is the set of all parses present in 
our stacks at the current stage k. An N-best EM variant is employed to reestimate the model 
parameters such that the PPL on training data is decreased — the likelihood of the training 
data under our model is increased. The reduction in PPL is shown experimentally to carry over 
to the test data. 



2 A* Decoder for Lattices 



The speech recognition lattice is an intermediate format in which the hypotheses produced by 
the first pass recognizer are stored. For each utterance we save a directed acyclic graph in which 
the nodes are a subset of the language model states in the composite hidden Markov model and 
the arcs — links — are labeled with words. Typically, the first pass acoustic/language model 
scores associated with each link in the lattice are saved and the nodes contain time alignment 
information. 

There are a couple of reasons that make A* [|J appealing for lattice decoding using the SLM: 

• the algorithm operates with whole prefixes, making it ideal for incorporating language models 
whose memory is the entire sentence prefix; 

• a reasonably good lookahead function and an efficient way to calculate it using dynamic 
programming techniques are both readily available using the n-gram language model. 

2.1 A* Algorithm 

Let a set of hypotheses L = {h : xi, . . . ,x n }, X{ G W* V i be organized as a prefix tree. 
We wish to obtain the maximum scoring hypothesis under the scoring function / : W* — > 5i: 
h* = argmax^gi f(h) without scoring all the hypotheses in L, if possible with a minimal 
computational effort. The A* algorithm operates with prefixes and suffixes of hypotheses - 
paths — in the set L; we will denote prefixes — anchored at the root of the tree — with x 
and suffixes — anchored at a leaf — with y. A complete hypothesis h can be regarded as the 
concatenation of a s prefix and a y suffix: h = x.y. 

To be able to pursue the most promising path, the algorithm needs to evaluate all the possible 
suffixes that are allowed in L for a given prefix x — Wi, . . . , w p — see Figure f|. Let Cl(x) be 
the set of suffixes allowed by the tree for a prefix x and assume we have an overestimate for 
the f(x.y) score of any complete hypothesis x.y: g(x.y) = f(x) + h(y\x) > f(x.y). Imposing 
that h(y\x) = for empty y, we have g(x) = /(x),V complete x G L that is, the overestimate 
becomes exact for complete hypotheses h G L. Let the A* ranking function gi(x) be: 

/ ° 

/ ^ © 



C L (x) 




^ © 

Figure 4: Prefix Tree Organization of a Set of Hypotheses L 



9l{x) = max g(x.y) = f(x) + h^x), where (5) 
yeC L (x) 

^l(^) = m ax h(y\x) (6) 
yeC L (x) 

gi{x) is an overestimate for the /(•) score of any complete hypothesis that has the prefix x; 
the overestimate becomes exact for complete hypotheses. The A* algorithm uses a potentially 
infinite stack in which prefixes x are ordered in decreasing order of the A* ranking function 
gi{x);&X each extension step the top-most prefix x = W\, . . . ,w p is popped from the stack, ex- 
panded with all possible one-symbol continuations of x in L and then all the resulting expanded 
prefixes — among which there may be complete hypotheses as well — are inserted back into 
the stack. The stopping condition is: whenever the popped hypothesis is a complete one, retain 
it as the overall best hypothesis h*. 

2.2 A* Lattice Rescoring 

A speech recognition lattice can be conceptually organized as a prefix tree of paths. When 
rescoring the lattice using a different language model than the one that was used in the first 
pass, we seek to find the complete path p = l . . . l n maximizing: 

n 

f{p) = Y\l°gPAM{k) + LMweight ■ logP LM (w(k)\w(l ) . . . w(k^)) - logP IP ] (7) 

where: 

• logPAh[{}i) is the acoustic model log-likelihood assigned to link If, 

• logPLM(w{h)\w(lo) . . .w(k-i)) is the language model log-probability assigned to link U given 
the previous links on the partial path Iq . . .If 

• LMweight > is a constant weight which multiplies the language model score of a link; its 
theoretical justification is unclear but experiments show its usefulness; 

• logPjp > is the "insertion penalty"; again, its theoretical justification is unclear but exper- 
iments show its usefulness. 

To be able to apply the A* algorithm we need to find an appropriate stack entry scoring function 
gi{x) where a; is a partial path and L is the set of complete paths in the lattice. Going back 
to the definition @ of gi(-) we need an overestimate g(x.y) = f(x) + h(y\x) > f(x.y) for all 
possible y = Ik ■ ■ - l n complete continuations of x allowed by the lattice. We propose to use the 
heuristic: 

n 

Kv\ x ) = Yl,{ lo 9 p AM{k) + LMweight ■ (logP NG (k) + logP C oMp) - logPip] 

i=k 

+LMweight ■ logPFiNAL • 5(k < n) (8) 

A simple calculation shows that if logPLM^h) satisfies: logPNG(h) + logPcoMP > logPLM(h),^h 
then ^l(x) = f(x) + max y€ c L (x)h(y\x) is a an appropriate choice for the A* stack entry scoring 
function. In practice one cannot maintain a potentially infinite stack. The logPcoMP an d 



logPpiNAL parameters controlling the quality of the overstimate in (H) are adjusted empirically. 
A more detailed description of this procedure is precluded by the length limit on the article. 

3 Experiments 

As a first step we evaluated the perplexity performance of the SLM relative to that of a baseline 
deleted interpolation 3-gram model trained under the same conditions: training data size 5Mwds 
(section 89 of WSJO), vocabulary size 65kwds, closed over test set. We have linearly interpolated 
the SLM with the 3-gram model: P(-) = A • Pz gra m{') + (1 — A) • Pslm(') showing a 16% 
relative reduction in perplexity; the interpolation weight was determined on a held-out set 
to be A = 0.4. A second batch of experiments evaluated the performance of the SLM for 

Trigram + SLM 
A 0.0 0.4 1.0 

PPL 116 109 130 
Lattice Trigram + SLM 

WER 11.5 9^6 10.6 
Table 1: Test Set Perplexity and Word Error Rate Results 

trigram lattice decoding^]. The results are presented in Table [l]. The SLM achieved an absolute 
improvement in WER of 1% (10% relative) over the lattice 3-gram baseline; the improvement is 
statistically significant at the 0.0008 level according to a sign test. As a by-product, the WER 
performance of the structured language model on 10-best list rescoring was 9.9%. 

4 Experiments: ERRATA 

We repeated the WSJ lattice rescoring experiments reported in in a standard setup. We 
chose to work on the DARPA'93 evaluation HUB1 test set — 213 utterances, 3446 words. The 
20kwds open vocabulary and baseline 3-gram model are the standard ones provided by NIST. 

As a first step we evaluated the perplexity performance of the SLM relative to that of 
a deleted interpolation 3-gram model trained under the same conditions: training data size 
20Mwds (a subset of the training data used for the baseline 3-gram model), standard HUB1 
open vocabulary of size 20kwds; both the training data and the vocabulary were re-tokenized 
such that they conform to the Upenn Treebank tokenization. We have linearly interpolated the 
SLM with the above 3-gram model: 

P(-) = A • P 3gram (-) + (1 - A) • Pslm(-) 

showing a 10% relative reduction over the perplexity of the 3-gram model. The results are 

lr The lattices were generated using a language model trained on 45Mwds and using a 5kwds vocabulary 
closed over the test data. 



presented in Table §. The SLM parameter reestimation procedure^] reduces the PPL by 5% ( 
2% after interpolation with the 3-gram model). The main reduction in PPL comes however from 
the interpolation with the 3-gram model showing that although overlapping, the two models 
successfully complement each other. The interpolation weight was determined on a held-out 
set to be A = 0.4. Both language models operate in the UPenn Treebank text tokenization. 
Trigram(20Mwds) + SLM 

A 0.0 0.4 1.0 

PPL, initial SLM, iteration 152 136 148 

PPL, reestimated SLM, iteration 1 144 133 148 

Table 2: Test Set Perplexity Results 

A second batch of experiments evaluated the performance of the SLM for 3-gram[| lattice 
decoding. The lattices were generated using the standard baseline 3-gram language model 
trained on 40Mwds and using the standard 20kwds open vocabulary. The best achievable 
WER on these lattices was measured to be 3.3%, leaving a large margin for improvement over 
the 13.7% baseline WER. 

For the lattice rescoring experiments we have adjusted the operation of the SLM such that 
it assigns probability to word sequences in the CSR tokenization and thus the interpolation 
between the SLM and the baseline 3-gram model becomes valid. The results are presented in 
Table [| The SLM achieved an absolute improvement in WER of 0.7% (5% relative) over the 
baseline despite the fact that it used half the amount of training data used by the baseline 
3-gram model. Training the SLM does not yield an improvement in WER when interpolating 
with the 3-gram model, although it improves the performance of the SLM by itself. 

Lattice Trigram(40Mwds) + SLM 

A 0.0 0.4 1.0 

WER, initial SLM, iteration 14.4 13.0 13.7 

WER, reestimated SLM, iteration 1 14.3 13.2 13.7 

Table 3: Test Set Word Error Rate Results 
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